Method and apparatus using spectral addition for speaker recognition

ABSTRACT

A method and apparatus for speaker recognition is provided that matches the noise in training data to noise in testing data using spectral addition. Under spectral addition, the mean and variance for a plurality of frequency components are adjusted in the training data and the test data so that each mean and variance is matched in a resulting matched training signal and matched test signal. The adjustments made to the training data and test data add to the mean and variance of the training data and test data instead of subtracting from the mean and variance.

This application is a divisional of and claims priority from U.S. patentapplication Ser. No. 11/065,573 filed on Feb. 24, 2005, which was adivisional of and claims priority from U.S. patent application Ser. No.09/685,534 filed on Oct. 10, 2000.

BACKGROUND OF THE INVENTION

The present invention relates to speaker recognition. In particular, thepresent invention relates to training and using models for speakerrecognition.

A speaker recognition system identifies a person from their speech. Suchsystems can be used to control access to areas or computer systems aswell as tailoring computer settings for a particular person.

In many speaker recognition systems, the system asks the user to repeata phrase that will be used for recognition. The speech signal that isgenerated while the user is repeating the phrase is then used to train amodel. When a user later wants to be identified by their speech, theyrepeat the identification phrase. The resulting speech signal, sometimesreferred to as a test signal, is then applied against the model togenerate a probability that the test signal was generated by the sameperson who produced the training signals.

The generated probability can then be compared to other probabilitiesthat are generated by applying the test signal to other models. Themodel that produces the highest probability is then considered to havebeen produced by the same speaker who generated the test signal. Inother systems, the probability is compared to a threshold probability todetermine if the probability is sufficiently high to identify the personas the same person who trained the model. Another type of system wouldcompare the probability to the probability of a general model designedto represent all speakers.

The performance of speaker recognition systems is affected by the amountand type of background noise in the test and training signals. Inparticular, the performance of these systems is negatively impacted whenthe background noise in the training signal is different from thebackground noise in the test signal. This is referred to as havingmismatched signals, which generally provides lower accuracy than havingso-called matched training and testing signals.

To overcome this problem, the prior art has attempted to match the noisein the training signal to the noise in the testing signal. Under somesystems, this is done using a technique known as spectral subtraction.In spectral subtraction, the systems attempt to remove as much noise aspossible from both the training signal and the test signal. To removethe noise from the training signal, the systems first collect noisesamples during pauses in the speech found in the training signal. Fromthese samples, the mean of each frequency component of the noise isdetermined. Each frequency mean is then subtracted from the remainingtraining speech signal. A similar procedure is followed for the testsignal, by determining the mean strength of the frequency components ofthe noise in the test signal.

Spectral subtraction is less than ideal as a noise matching technique.First, spectral subtraction does not remove all noise from the signals.As such, some noise remains mismatched. In addition, because spectralsubtraction performs a subtraction, it is possible for it to generate atraining signal or a test signal that has a negative strength for aparticular frequency. To avoid this, many spectral subtractiontechniques abandon the subtraction when the subtraction will result innegative strength, using a flooring technique instead. In those cases,the spectral subtraction technique is replaced with a technique ofattenuating the particular frequency.

For these reasons, a new noise matching technique for speakerrecognition is needed.

SUMMARY OF THE INVENTION

A method and apparatus for speaker recognition is provided that matchesthe noise in training data to the noise in testing data using spectraladdition. Under spectral addition, the mean and variance for a pluralityof frequency components are adjusted in the training data and the testdata so that each mean and variance is matched in a resulting matchedtraining signal and matched test signal. The adjustments made to thetraining data and test data add to the mean and variance of the trainingdata and test data instead of subtracting from the mean and variance.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram of one computing environment in which thepresent invention may be practiced.

FIG. 2 is a block diagram of an alternative computing environment inwhich the present invention may be practiced.

FIG. 3 is a flow diagram of one embodiment of a method of speakerrecognition under the present invention.

FIG. 4 is a more detailed block diagram of a speaker recognition systemof one embodiment of the present invention.

FIG. 5 is a more detailed block diagram of a noise matching componentunder one embodiment of the present invention.

FIG. 6 is a graph of a speech signal.

FIG. 7 is a flow diagram of a method of matching variances for afrequency component under one embodiment of the present invention.

FIG. 8 is a graph of the strength of a frequency component as a functionof time.

FIG. 9 is a graph of the strength of a frequency component as a functionof time for a noise segment showing the mean of the strength.

FIG. 10 is a graph of the strength of the frequency component of FIG. 9after subtracting the mean.

FIG. 11 is a graph of the strength of the frequency component of FIG. 10after multiplying by a gain factor.

FIG. 12 is a graph of the strength of the frequency component for asegment of the test signal or training signal.

FIG. 13 is a graph of the segment of FIG. 12 after subtracting thevariance pattern of FIG. 11.

FIG. 14 is a graph of the segment of FIG. 13 after adding the absolutevalue of the most negative value to all values of the frequencycomponent.

FIG. 15 is a flow diagram of a method of matching means under oneembodiment of the present invention.

FIG. 16 is a graph of the strength of a frequency component of one ofthe test signal or training signal.

FIG. 17 shows the graph of FIG. 16 after adding the difference in means.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENTS

FIG. 1 illustrates an example of a suitable computing system environment100 on which the invention may be implemented. The computing systemenvironment 100 is only one example of a suitable computing environmentand is not intended to suggest any limitation as to the scope of use orfunctionality of the invention. Neither should the computing environment100 be interpreted as having any dependency or requirement relating toany one or combination of components illustrated in the exemplaryoperating environment 100.

The invention is operational with numerous other general purpose orspecial purpose computing system environments or configurations.Examples of well known computing systems, environments, and/orconfigurations that may be suitable for use with the invention include,but are not limited to, personal computers, server computers, hand-heldor laptop devices, multiprocessor systems, mircoprocessor-based systems,set top boxes, programmable consumer electronics, network PCs,minicomputers, mainframe computers, distributed computing environmentsthat include any of the above systems or devices, and the like. Inaddition, the invention may be used in a telephony system.

The invention may be described in the general context ofcomputer-executable instructions, such as program modules, beingexecuted by a computer. Generally, program modules include routines,programs, objects, components, data structures, etc. that performparticular tasks or implement particular abstract data types. Theinvention may also be practiced in distributed computing environmentswhere tasks are performed by remote processing devices that are linkedthrough a communications network. In a distributed computingenvironment, program modules may be located in both local and remotecomputer storage media including memory storage devices.

With reference to FIG. 1, an exemplary system for implementing theinvention includes a general-purpose computing device in the form of acomputer 110. Components of computer 110 may include, but are notlimited to, a processing unit 120, a system memory 130, and a system bus121 that couples various system components including the system memoryto the processing unit 120. The system bus 121 may be any of severaltypes of bus structures including a memory bus or memory controller, aperipheral bus, and a local bus using any of a variety of busarchitectures. By way of example, and not limitation, such architecturesinclude Industry Standard Architecture (ISA) bus, Micro ChannelArchitecture (MCA) bus, Enhanced ISA (EISA) bus, Video ElectronicsStandards Association (VESA) local bus, and Peripheral ComponentInterconnect (PCI) bus also known as Mezzanine bus.

Computer 110 typically includes a variety of computer readable media.Computer readable media can be any available media that can be accessedby computer 110 and includes both volatile and nonvolatile media,removable and non-removable media. By way of example, and notlimitation, computer readable media may comprise computer storage mediaand communication media. Computer storage media includes both volatileand nonvolatile, removable and non-removable media implemented in anymethod or technology for storage of information such as computerreadable instructions, data structures, program modules or other data.Computer storage media includes, but is not limited to, RAM, ROM,EEPROM, flash memory or other memory technology, CD-ROM, digitalversatile disks (DVD) or other optical disk storage, magnetic cassettes,magnetic tape, magnetic disk storage or other magnetic storage devices,or any other medium which can be used to store the desired informationand which can be accessed by computer 100. Communication media typicallyembodies computer readable instructions, data structures, programmodules or other data in a modulated data signal such as a carrier waveor other transport mechanism and includes any information deliverymedia. The term “modulated data signal” means a signal that has one ormore of its characteristics set or changed in such a manner as to encodeinformation in the signal. By way of example, and not limitation,communication media includes wired media such as a wired network ordirect-wired connection, and wireless media such as acoustic, FR,infrared and other wireless media. Combinations of any of the aboveshould also be included within the scope of computer readable media.

The system memory 130 includes computer storage media in the form ofvolatile and/or nonvolatile memory such as read only memory (ROM) 131and random access memory (RAM) 132. A basic input/output system 133(BIOS), containing the basic routines that help to transfer informationbetween elements within computer 110, such as during start-up, istypically stored in ROM 131. RAM 132 typically contains data and/orprogram modules that are immediately accessible to and/or presentlybeing operated on by processing unit 120. By way o example, and notlimitation, FIG. 1 illustrates operating system 134, applicationprograms 135, other program modules 136, and program data 137.

The computer 110 may also include other removable/non-removablevolatile/nonvolatile computer storage media. By way of example only,FIG. 1 illustrates a hard disk drive 141 that reads from or writes tonon-removable, nonvolatile magnetic media, a magnetic disk drive 151that reads from or writes to a removable, nonvolatile magnetic disk 152,and an optical disk drive 155 that reads from or writes to a removable,nonvolatile optical disk 156 such as a CD ROM or other optical media.Other removable/non-removable, volatile/nonvolatile computer storagemedia that can be used in the exemplary operating environment include,but are not limited to, magnetic tape cassettes, flash memory cards,digital versatile disks, digital video tape, solid state RAM, solidstate ROM, and the like. The hard disk drive 141 is typically connectedto the system bus 121 through a non-removable memory interface such asinterface 140, and magnetic disk drive 151 and optical disk drive 155are typically connected to the system bus 121 by a removable memoryinterface, such as interface 150.

The drives and their associated computer storage media discussed aboveand illustrated in FIG. 1, provide storage of computer readableinstructions, data structures, program modules and other data for thecomputer 110. In FIG. 1, for example, hard disk drive 141 is illustratedas storing operating system 144, application programs 145, other programmodules 146, and program data 147. Note that these components can eitherbe the same as or different from operating system 134, applicationprograms 135, other program modules 136, and program data 137. Operatingsystem 144, application programs 145, other program modules 146, andprogram data 147 are given different numbers here to illustrate that, ata minimum, they are different copies.

A user may enter commands and information into the computer 110 throughinput devices such as a keyboard 162, a microphone 163, and a pointingdevice 161, such as a mouse, trackball or touch pad. Other input devices(not shown) may include a joystick, game pad, satellite dish, scanner,or the like. These and other input devices are often connected to theprocessing unit 120 through a user input interface 160 that is coupledto the system bus, but may be connected by other interface and busstructures, such as a parallel port, game port or a universal serial bus(USB). A monitor 191 or other type of display device is also connectedto the system bus 121 via an interface, such as a video interface 190.In addition to the monitor, computers may also include other peripheraloutput devices such as speakers 197 and printer 196, which may beconnected through an output peripheral interface 190.

The computer 110 may operate in a networked environment using logicalconnections to one or more remote computers, such as a remote computer180. The remote computer 180 may be a personal computer, a hand-helddevice, a server, a router, a network PC, a peer device or other commonnetwork node, and typically includes many or all of the elementsdescribed above relative to the computer 110. The logical connectionsdepicted in FIG. 1 include a local area network (LAN) 171 and a widearea network (WAN) 173, but may also include other networks. Suchnetworking environments are commonplace in offices, enterprise-widecomputer networks, intranets and the Internet.

When used in a LAN networking environment, the computer 110 is connectedto the LAN 171 through a network interface or adapter 170. When used ina WAN networking environment, the computer 110 typically includes amodem 172 or other means for establishing communications over the WAN173, such as the Internet. The modem 172, which may be internal orexternal, may be connected to the system bus 121 via the user inputinterface 160, or other appropriate mechanism. In a networkedenvironment, program modules depicted relative to the computer 110, orportions thereof, may be stored in the remote memory storage device. Byway of example, and not limitation, FIG. 1 illustrates remoteapplication programs 185 as residing on remote computer 180. It will beappreciated that the network connections shown are exemplary and othermeans of establishing a communications link between the computers may beused.

FIG. 2 is a block diagram of a mobile device 200, which is an exemplarycomputing environment. Mobile device 200 includes a microprocessor 202,memory 204, input/output (I/O) components 206, and a communicationinterface 208 for communicating with remote computers or other mobiledevices. In one embodiment, the afore-mentioned components are coupledfor communication with one another over a suitable bus 210.

Memory 204 is implemented as non-volatile electronic memory such asrandom access memory (RAM) with a battery back-up module (not shown)such that information stored in memory 204 is not lost when the generalpower to mobile device 200 is shut down. A portion of memory 204 ispreferably allocated as addressable memory for program execution, whileanother portion of memory 204 is preferably used for storage, such as tosimulate storage on a disk drive.

Memory 204 includes an operating system 212, application programs 214 aswell as an object store 216. During operation, operating system 212 ispreferably executed by processor 202 from memory 204. Operating system212, in one preferred embodiment, is a WINDOWS® CE brand operatingsystem commercially available from Microsoft Corporation. Operatingsystem 212 is preferably designed for mobile devices, and implementsdatabase features that can be utilized by applications 214 through a setof exposed application programming interfaces and methods. The objectsin object store 216 are maintained by applications 214 and operatingsystem 212, at least partially in response to calls to the exposedapplication programming interfaces and methods.

Communication interface 208 represents numerous devices and technologiesthat allow mobile device 200 to send and receive information. Thedevices include wired and wireless modems, satellite receivers andbroadcast tuners to name a few. Mobile device 200 can also be directlyconnected to a computer to exchange data therewith. In such cases,communication interface 208 can be an infrared transceiver or a serialor parallel communication connection, all of which are capable oftransmitting streaming information.

Input/output components 206 include a variety of input devices such as atouch-sensitive screen, buttons, rollers, and a microphone as well as avariety of output devices including an audio generator, a vibratingdevice, and a display. The devices listed above are by way of exampleand need not all be present on mobile device 200. In addition, otherinput/output devices may be attached to or found with mobile device 200within the scope of the present invention.

Under the present invention, an apparatus and method are provided thatimprove the matching of noise between training data and test data. FIG.3 shows one embodiment of a method for performing such matching.

In step 300 of FIG. 3, a speaker recognition system 398, shown in FIG.4, receives and stores a training signal. The training signal isreceived through a microphone 404, which detects a speech signalproduced by speaker 400 and additive noise 402. Typically, the trainingsignal is generated by the speaker reading a short identification textsuch as “Log me in”. Microphone 404 converts the acoustic signalproduced by speaker 400 and additive noise 402 into an electrical analogsignal that is provided to an analog-to-digital converter 406.Analog-to-digital converter 406 samples the analog signal to producedigital values that are stored in a training data storage 407. Althoughthe training data is shown as being stored as time domain values, thoseskilled in the art will recognize that the training data may be storedin the frequency domain or as a collection of feature vectors.

At step 302 of FIG. 3, speaker recognition system 398 receives a testsignal from speaker 400 along with different additive noise 402. Thistest signal is typically generated by repeating the identificationphrase that was used to generate the training data. Like the trainingsignal, the test signal passes through a microphone and ananalog-to-digital converter. Although only one microphone 404 andanalog-to-digital converter 406 are shown in FIG. 4, those skilled inthe art will recognize that a different microphone and analog-to-digitalconverter can be used during testing than was used during training. Thedigital values produced by analog-to-digital converter 406 are thenprovided to a noise matching component 408, which also receives thetraining data that is stored in training data storage 407.

At step 304, noise matching component 408 identifies and stores thespectrum of selected samples found in training signal and the testsignal. The elements for performing this identification are shown inmore detail in FIG. 5.

In FIG. 5, the test data and training data are each provided to arespective frame construction unit 500, 501, which divide the respectivesignals into frames, each typically 25 milliseconds long and eachstarting 10 milliseconds after the start of the previous frame. Eachframe is multiplied by a respective window 502, 503, which is typicallya Hamming window or a Hanning window. The resulting windows of speechare provided to respective noise identification units 504 and 505

Noise identification units 504 and 505 identify which frames containonly noise and which frames contain a combination of noise and speech.As can be seen in FIG. 6, a speech signal contains active speech regionsand non-active speech regions. In FIG. 6, time is shown along horizontalaxis 600 and the amplitude of the speech signal is shown along verticalaxis 602. The speech signal of FIG. 6 includes one active region 604that contains both noise and speech and two non-active regions 606 and608 that contain only noise and represent periods where the speaker haspaused.

Noise identification units 504 and 505 can use any of a number of knowntechniques to classify the frames as speech or noise. As is known in theart, these techniques can operate on the windowed speech signal directlyor on transformations of the speech signal such as Fast FourierTransform values or mel-cepstrum features.

When noise identification units 504 and 505 of FIG. 5 classify a frameas a noise frame, they pass the noise frame through a respective FastFourier Transform (FFT) 506, 508. Each FFT 506, 508 converts the valuesin the noise frame into a collection of frequency values representingthe spectrum of the signal in the frame. These frequency valuesrepresent the relative strength of each frequency component in thespectrum of the signal and can be amplitude values or energy values. TheFFTs produce complex numbers for the frequency values, energy iscalculated as the square of the real value added to the square of theimaginary value. The amplitude is simply the square root of the energy.(Note that in the present application, and specifically in the claims, areference to a strength value for a frequency component can beinterpreted as either an amplitude value or an energy value.)

The strength values are stored in a noise storage 510. As is shown inFIG. 5, noise storage 510 is divided into two sections, a trainingstorage. 512, which contains noise frames from the training speech and atest storage 514, which contains noise frames from the test speech.

Once the spectrum of the noise frames for the training signal and testsignal have been stored at step 304 of FIG. 3, the process of FIG. 3continues at step 306. In step 306, the means and variances of aplurality of frequency components in the noise of the training signaland in the noise of the test signal are adjusted so that the means andvariances are the same in the noise of both signals. This is performedby a spectral adder 516, which accesses the noise segments stored innoise storage 510. The technique for adjusting the means and variancesof the noise is discussed further below in connection with FIG. 7.

Once the variances and the means of each frequency component of thenoise have been matched, the matched training signal is output byspectral adder 516 to a feature extractor 410 of FIG. 4. Featureextractor 410 extracts one or more features from the training signal.Examples of possible feature extraction modules that can be used underthe present invention include modules for performing linear predictivecoding (LPC), LPC direct cepstrum, perceptive linear prediction (PLP),auditory model feature extraction, and Mel-frequency cepstrumcoefficients feature extraction. Note that the invention is not limitedto these feature extraction modules and that other modules may be usedwithin the context of the present invention.

Using the features extracted by feature extractor 410, the method ofFIG. 3 continues at step 310 by training a model based on the featuresextracted from the matched training signal. This training is performedby a trainer 424 of FIG. 4 based on the training identification phrase426.

Using the extracted features and the training phrase, trainer 424 buildsan acoustic model 418. In one embodiment, acoustic model 418 is a HiddenMarkov Model (HMM). However, other models may be used under the presentinvention including segment models. Typically, feature vectors can beevaluated against the model, giving a probability that each featurevector was spoken by the same speaker who trained the model. Some modelsare dependent on what is spoken (so-called text-dependent), other typesof models (text-independent) simply evaluate whether any sequence ofsounds came from the same speaker who trained the model.

Once the acoustic model has been trained, spectral adder 516 providesthe matched test signal to feature extractor 410 which extracts the sametype of features from the matched test signal that were extracted fromthe matched training signal.

At step 312 of FIG. 3, the extracted features from the matched testsignal are applied to acoustic model 418 by a decoder 412. Usingacoustic model 418, decoder 412 determines an overall probability thatthe matched test signal was generated by the same speaker who trainedacoustic model 418. This output probability can either be compared toother probabilities generated for other sets of training data or can becompared to a threshold value to determine whether the probabilityprovided by decoder 412 is sufficiently high to identify the speaker asthe same person who trained the current model.

Note that in the method of FIG. 3, the model is trained using matchedtraining data that has had its noise matched to the noise in the matchedtest data. Also note that the matched test data is applied to the model.Under the present invention, such matching is thought to provide a moreaccurate probability measure for speaker identification.

Step 306 of FIG. 3, which shows the step of adjusting the variances andmeans of the noise in the training and test signals, represents a stepof spectral addition that is performed in order to match the noise inthe training signal to the noise in the test signal. Specifically, thisstep hopes to match the mean strength of each frequency in the noise ofthe test signal to the mean strength of each frequency in the noise ofthe training signal and to match the variance in the strength of eachfrequency component in the noise of these signals.

Under most embodiments of the present invention, the matching isperformed by first identifying which signal's noise has the higher meanstrength for each frequency component and which signal's noise has thehigher variance for each frequency component. The test signal and thetraining signals are then modified by adding properly adjusted noisesegments to each signal so that the mean and variance of each frequencycomponent of the noise in the modified signals are equal to the maximummean and maximum variance found in the noise of either signal. Under oneembodiment, a cross-condition is applied so that the noise segments thatare added to the test signal come from the training signal and the noisesegments that are added to the training signal come from the testsignal.

As an example, let us say that at frequency F1, the training noise has amean of 5 and a variance of 2, the testing noise has a mean of 4 and avariance of 3. The following noise will be added to the training signal:test noise at frequency F1 modified such that when added to the trainingsignal, the combined signal will have mean 5 (the greater of the twomeans) and variance 3 (the greater of the two variances). Thus, thesignal to add will have mean 0 and variance 1, since the mean of summedsignals is always additive, and the variance of summed linearlyindependent signals is additive (see Fundamentals of Applied ProbabilityTheory, Alvin M. Drake, McGraw-Hill Book Company, 1988, p 108). In orderto make the test noise segment have these characteristics, the noisesegment is shifted and scaled as discussed further below.

Similarly, the noise segment added to the test signal will be a trainingnoise segment that has been scaled and shifted to have a mean of 1 and avariance of zero. When added to the test signal, the noise segment willcause the modified test signal to have mean 5 and variance 3 just likethe modified training signal. As will be shown below, this technique ofalways selecting the signal with the higher mean or higher variance asthe signal to match to, eliminates the need for flooring that causesspectral subtraction to be less than ideal.

The means and variances of the noise may be adjusted independently byadding two different respective signals to both the test speech signaland training speech signal or at the same time by adding one respectivesignal to both the test speech signal and the training speech signal. Inembodiments where two signals are used, the mean may be adjusted before:the variance or after the variance. In addition, the means and variancesdo not have to both be adjusted, one may be adjusted without adjustingthe other. In the discussion below, the embodiment in which twodifferent signals are applied to both the test signal and the trainingsignal is described. In this embodiment, signals to match the variancesof the noise are first added to the speech signal and then signals tomatch the means of the noise are added to the speech signals.

The steps for adjusting the variance for a single frequency component ofthe noise are shown in FIG. 7. The method of FIG. 7 begins at step 700where the variance of the noise in the training signal is determined. Todetermine the variance of a particular frequency component in the noiseof the training signal, the method tracks strength values (i.e.amplitude values or energy values) of this frequency component indifferent noise segments stored in noise storage 510 of FIG. 5. Methodsfor determining the variance of such values are well known.

An example of how a frequency component's strength values change overtime is shown graph 804 of FIG. 8 where time is shown along horizontalaxis 800, and the strength of the frequency component is shown alongvertical axis 802. Note that although graph 804 is depicted as acontinuous graph, the strength values can be either discrete orcontinuous under the present invention.

To calculate the complete variance in the noise of the training signal,the strength of the frequency component is measured at each noise framein the entire training corpus. For example, if the user repeated theidentification phrase three times during training, the variance in thenoise would be determined by looking at all of the noise frames found inthe three repetitions of the training phrase.

After the variance of the frequency component in the noise of thetraining signal has been determined, the method of FIG. 7 continues atstep 702 where the variance of the frequency component in the noise ofthe test signal is determined. The variance of the frequency componentin the noise of the test signal is determined using; the same techniquesdiscussed above for the training signal.

Once the variances of the frequency component in the noise have beendetermined for the training signal and the test signal, the presentinvention determines which signal has the greater variance in the noiseand then adds a noise segment to the other signal to increase thevariance of the frequency component in the signal that has the lesservariance in the noise so that its variance in the noise matches thevariance in the noise of the other signal. For example, if the varianceof the frequency component in the noise of the training signal were lessthan the variance of the frequency component in the noise of the testsignal, a modified noise segment from the test signal would be added tothe training signal so that the variance in the noise in the trainingsignal matches the variance in the noise in the test signal.

Under one embodiment, the noise segments are not added directly to thesignals to change their variance. Instead the mean strength of thefrequency component is set to zero across the noise segment and thevariance of the noise segment is scaled. These changes limit the size ofthe strength values that are added to the test signal or training signalso that the variances in the noise in the test signal and trainingsignal match but the mean strength in the two signals is not increasedany more than is necessary. The process of selecting a noise segment,setting the mean of the noise segment's frequency component to zero, andscaling the variance of the noise segment's frequency component areshown as steps 704, 706, 708 and 710 in FIG. 7.

First, at step 704, a noise segment is selected from the testing signalto be added to the training signal and from the training signal to beadded to the test signal. These noise segments typically include aplurality of frames of the training signal or testing signal and can betaken from noise storage 510 of FIG. 5. Specifically, the strengthvalues for the current frequency component are retrieved.

An example of how the frequency component's strength for such a selectednoise segment changes over time is shown as graph 904 in FIG. 9. In FIG.9, time is shown along horizontal axis 900 and the strength of thefrequency component is shown along vertical axis 902. Although graph 904shows the frequency component as continuous, the frequency component maybe continuous or discrete under the present invention.

After the noise segment has been selected, the mean of the strength ofthe frequency component in the noise segment is determined at step 706.In FIG. 9, the mean is shown as horizontal line 906.

In step 708 of FIG. 7, the mean determined in step 706 is subtractedfrom each of the strength values of the frequency component across theentire noise segment. For an embodiment where the strength values arecontinuous, this involves subtracting line 906 of FIG. 9 from graph 904of FIG. 9. This subtraction results in a set of modified strength valuesfor the frequency component of the noise segment. A graph 1004 of suchmodified strength values is shown in FIG. 10 where time is shown alonghorizontal axis 1000 and strength is shown along vertical axis 1002.

The mean strength of the frequency component in the noise segment issubtracted from the frequency component's strength values in order togenerate a set of strength values that have zero mean but still maintainthe variance found in the original noise segment. Thus, in FIG. 10, thestrength of the frequency component continues to vary as it did in theoriginal noise segment, however, its mean has now been adjusted to zero.

In step 710, once the values of the frequency component's strength havebeen adjusted so that they have zero mean, the values are scaled so thatthey provide a proper amount of variance. This scaling factor isproduced by multiplying each of the strength values by a variance gainfactor. The variance gain factor, G, is determined by the followingequation: $\begin{matrix}{G = \frac{{\sigma_{TRAIN}^{2} - \sigma_{TEST}^{2}}}{\sigma_{NOISE}^{2}}} & {{Eq}.\quad 1}\end{matrix}$where G is the variance gain factor, σ_(TRAIN) ² is the variance in thenoise of the training signal, σ_(TEST) ² is the variance in the noise ofthe test signal, and σ_(NOISE) ² is the variance of the values in thezero-mean noise segment produced at step 708.

The result of multiplying strength values by the gain factor of equation1 is shown in graph 1104 of FIG. 11 where time is shown along horizontalaxis 1100 and strength is shown along vertical axis 1102. The frequencycomponent values has the same general shape as graph 1004 of FIG. 10 butis simply magnified or compressed.

After step 710, the modified frequency component values of the noisesegment have zero mean and a variance that is equal to the differencebetween the variance of the training signal and the variance of the testsignal. Thus, the modified values can be thought of as a variancepattern. When added to the signal with the lesser variance in the noise,the strength values of this variance pattern cause the signal with thelesser variance in the noise to have a new variance in the noise thatmatches the variance in the noise of the signal with the larger variancein the noise. For example, if the test signal had a lower variance inits noise than the training signal, adding the variance pattern from thetraining noise segment to each of a set of equally sized segments in thetest signal would generate a test signal with a variance due to noisethat matches the higher variance in the noise of the training signal.The step of adding the variance pattern to the strength values of thetest signal or training signal is shown as step 712.

Note that for the signal with the higher variance in the noise, thevariance gain factor is set to zero. When multiplied by the strengthvalues of the noise segment, this causes the modified noise segment tohave a mean of zero and a variance of zero.

Note that because of the subtraction performed in step 708, the testsignal or training signal produced after step 712 may have a negativestrength for one or more frequency components. For example, FIG. 12shows strength values for the frequency component of either the testsignal or training signal, with time shown along horizontal axis 1200and strength shown along vertical axis 1202. Since the strength valuesin FIG. 12 are taken from an actual test signal or training signal, allof the strength values in graph 1204 are positive. However, FIG. 13shows the result of the addition performed in step 712 where thestrength values in segments of the test signal are added to respectivestrength values of the variance pattern shown in FIG. 11. In FIG. 13,time is shown along horizontal axis 1300 and strength is shown alongvertical axis 1302. Graph 1304 of FIG. 13 represents the addition ofgraph 1104 of FIG. 11 with graph 1204 of FIG. 12. As shown in FIG. 13,graph 1304 includes negative values for some strengths of the frequencycomponent because the variance pattern included some negative valuesafter the mean of the noise segment was subtracted in step 708.

Since a negative strength (either amplitude or energy) for a frequencycomponent cannot be realized in a real system, the strength values forthe frequency component in the test signal and training signal must beincreased so all of the values are greater than or equal to zero. Inaddition, the strength values must be increased uniformly so that thevariance of the noise in the two signals is unaffected.

To do this, one embodiment of the present invention searches for themost negative value in the entire signal that had its varianceincreased. This minimum value is shown as minimum 1306 in FIG. 13. Oncethe minimum value has been identified, its absolute value is added toeach of the strength values for the frequency component across theentire test signal and the entire training signal. This is shown as step716 in FIG. 7.

FIG. 14 provides a graph 1404 of the signal of FIG. 13 after thisaddition, showing that the strength for the frequency component now hasa minimum of zero. In FIG. 14, time is shown along horizontal axis 1400and strength is shown along vertical axis 1402. Since the strength valueadded to each of the strength values is the same, the variance of thenoise in the test signal and training signal are unchanged.

Note that the strength value must be added to both the test signal andthe training signal regardless of which signal had its varianceincreased. If this were not done, the mean of the noise in one of thesignals would increase while the mean of the noise in the other signalwould remain the same. This would cause the means of the noise to becomemismatched.

In FIG. 7, the step of adjusting the modified test signal and trainingsignal to avoid having negative values in those signals has been shownas occurring before the means of the noise of the two signals have beenmatched. In other embodiments, this step is performed after the means ofthe noise have matched. One benefit of waiting to adjust the signals fornegative values until after the means of the noise have been matched isthat the step of matching the means of the noise may cause the signalsto be increased to the point where they do not include any negativevalues.

After step 716, the variances of the noise of the test signal and thetraining signal are matched and each signal only has positive strengthvalues for each frequency component.

Note that the steps of FIG. 7 are repeated for each desired frequencycomponent in the test signal and training signal. Also note that thevariance of the noise for some frequency components will be higher inthe test signal than in the training signal, while for other frequencycomponents, the variance of the noise in the test signal will be lowerthan in the training signal. Thus, at some frequencies, a variancepattern formed from a noise segment will be added to the test signal,while at other frequencies, a variance pattern formed from a noisesegment will be added to the training signal.

Once the variances in the noise of the training signal and test signalhave been matched, the means of the strength values in the noise of thetwo signals are matched. This is shown as step 308 in FIG. 3 and isshown in detail in the flow diagram of FIG. 15. As in the method of FIG.7, the steps for matching a mean strength of the noise shown in FIG. 15are repeated for each frequency component of interest in the noise ofthe test signal and training signal. Consistent with the discussionabove, the mean strength of the noise can either be the mean amplitudeor the mean energy, depending on the particular embodiment.

In step 1500 of FIG. 15, the mean strength of the current frequencycomponent in the noise of the training signal is determined. This meancan be calculated by accessing the strength values stored in noisestorage 506 for the noise segments of the training signal. In step 1502,the mean strength for the current frequency component in the noise ofthe test signal is determined by accessing the strength values found innoise storage 508.

In step 1504 of FIG. 15, the difference between the means in the noiseof the test signal and the training signal are determined. This involvessimply subtracting the mean strength of the noise of one signal from themean strength of the noise in the other signal and taking the absolutevalue of the result.

In step 1506, the signal with the lower mean in the noise has all of itsstrength values for the frequency component increased by an amount equalto the difference between the means of the noise in the test signal andthe noise in the training signal. This can be seen by comparing FIGS. 16and 17. In FIG. 16, graph 1604 shows the strength of a frequencycomponent of the test signal or training signal as a function of time.In FIG. 16, time is shown along horizontal axis 1600 and strength isshown along vertical axis 1602. FIG. 17 shows the same frequencycomponent for the same signal after the difference between the means inthe noise of the test signal and training signal has been added to thesignal of FIG. 16. Thus, graph 1704 of FIG. 17 has the same shape ofgraph 1604 of FIG. 16 but is simply shifted upward. This upward shiftdoes not change the variance in the noise, but simply shifts the mean ofthe frequency component across the signal. Thus, the variances in thenoise continue to be matched after the steps of FIG. 15.

Note that for some frequency components, the mean of the frequencycomponent in the noise in the test signal is greater than the mean ofthe frequency component in the noise in the training signal while atother frequencies the reverse is true. Thus, at some frequencies, thedifference between the means of the noise is added to the test signalwhile at other frequencies the difference between the means of the noiseis added to the training signal.

As mentioned above, in alternative embodiments, only one respectivenoise signal is added to each of the training signal and test signal inorder to match both the variance and means of the noise of thosesignals. Thus, one noise signal generated from a training noise segmentwould be added to the test signal and one noise signal generated from atest noise segment would be added to the training signal. Under oneembodiment, the one noise signal to be added to each speech signal isformed by adding the difference between the means of the noise to all ofthe values of the variance pattern of the signal with the lower mean inthe noise. The resulting mean adjusted variance pattern is then added toits respective signal as described above.

From the above discussion it can be seen that after the steps of FIG. 7and FIG. 15 have been performed for each frequency component in the testsignal and training signal, the mean of each frequency component in thenoise of the training and test signals is the same and the variance ofthe frequency component in the noise of each signal is the same. Thismeans that the noise in the training data and test data are matched.

Multiple training signals can be dealt with in several ways. Two primaryways are discussed here. First, if all the training signals areconsidered to have been generated in the same noisy environment, theycan be considered to be one training signal for the above description.If they might have come from separate noisy environments, such as wouldoccur if they were recorded at separate times, the above descriptionwould simply be extended to multiple signals. The mean and variance ofeach frequency of the noise of all signals would be appropriatelyadjusted (through adding noise from the other conditions) to have themaximum mean and variance at each frequency in the noise of any of themultiple signals.

Although the present invention has been described with reference toparticular embodiments, workers skilled in the art will recognize thatchanges may be made in form and detail without departing from the spiritand scope of the invention.

1. A method of speaker recognition that generates a likelihood that thesame speaker generated a training signal and a test signal, the methodcomprising: generating a matched test signal and a matched trainingsignal by performing steps for each of a plurality of frequencycomponents, the steps comprising: adding to the strength of thefrequency component in one of the test signal or training signal so thatthe mean strength of the frequency component of noise in the matchedtest signal matches the mean strength of the frequency component ofnoise in the matched training signal; determining the variance of thefrequency component of noise in the training signal; determining thevariance of the frequency component of noise in the test signal; andincreasing the variance of the frequency component in one of the testsignal or the training signal so that the variance of the frequencycomponent in the noise of the matched training signal matches thevariance of the frequency component in the noise of the matched testsignal; creating a model based on the matched training signal; andapplying the matched test signal to the model to produce the likelihoodthat a same speaker generated the training signal and the test signal.2. The method of claim 1 wherein performing steps for each of aplurality of frequency components, further comprises the steps of:determining the mean strength of the frequency component of noise in thetraining signal; determining the mean strength of the frequencycomponent of noise in the test signal, and subtracting the mean strengthof noise in the training signal from the mean strength of noise in thetest signal to determine a value to add during the step of adding to thestrength of the frequency component in one of the test signal ortraining signal.
 3. The method of claim 1 wherein for each frequencycomponent the step of adding to the strength of the frequency componentin one of the test signal or training signal comprises adding to thestrength of the frequency component in the test signal.
 4. The method ofclaim 1 wherein for each frequency component the step of adding to thestrength of the frequency component in one of the test signal ortraining signal comprises adding to the strength of the frequencycomponent in the training signal.
 5. The method of claim 1 wherein forsome frequency components the step of adding to the strength of thefrequency component in one of the test signal or training signalcomprises adding to the strength of the frequency component in thetraining signal and for other frequency components the step of adding tothe strength of the frequency component in one of the test signal ortraining signal comprises adding to the strength of the frequencycomponent in the test signal.
 6. The method of claim 1 wherein adding tothe strength of the frequency component in one of the test signal ortraining signal does not change the variances of the frequency componentin the test signal and the training signal.
 7. (canceled)
 8. The methodof claim 1 wherein increasing the variance of the frequency component inone of the test signal or the training signal comprises: deriving avariance pattern from a noise segment taken from one of the test signalor training signal; and adding the variance pattern to all segments ofone of the test signal or the training signal.
 9. The method of claim 8wherein deriving the variance pattern of the noise segment comprises:determining the mean of the noise segment; subtracting the mean of thenoise segment from the noise segment to produce a zero-mean noisesegment; and multiplying the zero-mean noise segment by a gain factor toproduce the variance pattern.
 10. The method of claim 9 wherein addingthe variance pattern to all segments of one of the test signal ortraining signal further comprises: after adding the variance patterndetermining the most negative value for the frequency component in oneof the test signal or the training signal; and adding a value equal tothe magnitude of the most negative value to the frequency component ofboth the test signal and the training signal.
 11. The method of claim 9wherein adding the variance pattern to all segments of one of the testsignal or training signal further comprises: after adding the variancepattern and adding to the strength of the frequency component in one ofthe test signal or training signal, determining the most negative valuefor the frequency component in one of the test signal or the trainingsignal; and adding a value equal to the magnitude of the most negativevalue to the frequency component of both the test signal and thetraining signal.
 12. The method of claim 8 wherein adding to thestrength of the frequency component in one of the test signal ortraining signal comprises adding to the strength of the variance patternbefore adding the variance pattern to all segments of one of the testsignal or training signal.
 13. The method of claim 8 wherein deriving avariance pattern comprises deriving a variance pattern from a noisesegment taken from the test signal and wherein adding the variancepattern comprises adding the variance pattern to all segments of thetraining signal.
 14. The method of claim 8 wherein deriving a variancepattern comprises deriving a variance pattern from a noise segment takenfrom the training signal and wherein adding the variance patterncomprises adding the variance pattern to all segments of the testsignal.
 15. A method of identifying a speaker comprising: receiving atraining speech signal; receiving a test speech signal; for each of aplurality of frequency components adding to the variance of thefrequency component in one of the training speech signal or test speechsignal so that the variance of the frequency component of noise in amatched training speech signal matches the variance of the frequencycomponent of noise in matched test speech signal, wherein adding to thevariance of the frequency component comprises: identifying a series ofstrength values for the frequency component in a segment of noise takenfrom one of the training speech signal or the test speech signal;finding the mean of the series of strength values; subtracting the meanfrom each strength value in the series of strength values to generatezero-mean strength values; multiplying the zero-mean strength values bya gain factor to produce a variance pattern; and adding the variancepattern to each segment of one of the training speech signal or the testspeech signal; generating a model from the matched training speechsignal; and comparing the matched test speech signal to the model toidentify the speaker.
 16. (canceled)
 17. The method of claim 15 whereinafter adding the variance pattern the method further comprises:determining the most negative value for the strength of the frequencycomponent in the one of the training speech signal or test speech signalto which the variance pattern was added; and adding the absolute valueof the most negative value to the strength of the frequency componentover the entire training speech signal and the entire test speechsignal.
 18. The method of claim 15 further comprising for each of aplurality of frequency components adding to the frequency component ofone of the test speech signal or the training speech signal so that themean strength of the frequency component of the noise in the test speechsignal is matched to the mean strength of the frequency component of thenoise in the training speech signal.
 19. The method of claim 15 furthercomprising before adding the variance pattern to each segment of one ofthe training speech signal or test speech signal: adding a same value toeach strength value of the variance pattern so that the mean strength ofthe frequency component of the noise in the matched test speech signalis matched to the mean strength of the frequency component of the noisein the matched training speech signal when the variance pattern is addedto each segment of one of the training speech signal or the test speechsignal.
 20. (canceled)
 21. A computer-readable medium havingcomputer-executable instructions for performing speaker recognition, theinstructions performing steps comprising: receiving a training speechsignal; receiving a test speech signal; adding to the strength of atleast one frequency component across the entirety of one of the trainingspeech signal or test speech signal in the production of a matchedtraining speech signal and a matched test speech signal such that themean strength of the frequency component in noise in the matchedtraining speech signal is the same as the mean strength of the frequencycomponent in noise in the matched test speech signal; selectively addingto the strength of the frequency component in one of the training speechsignal or the test speech signal in further production of the matchedtraining speech signal and the matched test speech signal such that thevariance of the strength of the frequency component of noise in thematched training speech signal is equal to the variance of the strengthof the frequency component of noise in the matched test speech signal;generating a model from the matched training speech signal; andcomparing the matched test speech signal to the model to identify aspeaker.
 22. The computer-readable medium of claim 21 wherein adding tothe strength of a frequency component comprises: determining the meanstrength of the frequency component in noise in the training speechsignal; determining the mean strength of the frequency component innoise in the test speech signal; determining the difference between themean strength in noise in the training speech signal and the meanstrength in noise in the test speech signal; adding the difference tothe strength of the frequency component in one of the training speechsignal or test speech signal.
 23. (canceled)
 24. The computer-readablemedium of claim 21 wherein selectively adding to the strength of thefrequency component comprises: selecting a noise segment from one of thetraining speech signal or the test speech signal; identifying strengthvalues of the frequency component in the noise segment; determining themean of the strength values; subtracting the mean of the strength valuesfrom the strength values to produce a sequence of mean adjusted strengthvalues; multiplying the mean adjusted strength values by a gain factorto produce gain adjusted strength values; adding the gain adjustedstrength values to respective strength values of the frequency componentin each of a plurality of segments that together constitute one of thetraining speech signal or test speech signal.
 25. The computer-readablemedium of claim 24 wherein adding to the strength of at least onefrequency component across the entirety of one of the training speechsignal or test speech signal comprises adding the same value to all ofthe gain adjusted strength values before adding the gain adjustedstrength values to the respective strength values.
 26. Thecomputer-readable medium of claim 25 wherein selectively adding to thestrength of the frequency component further comprises: identifying themost negative value produced by adding the gain adjusted strength valuesto the respective strength values of the frequency component in each ofa plurality of segments that constitute one of the training speechsignal and test speech signal; and adding a value equal to the absolutemagnitude of the most negative value to each strength value of thefrequency component in both the training speech signal and the testspeech signal.
 27. The computer-readable medium of claim 24 wherein thecomputer-executable instructions perform further steps comprising:determining the variance of strength values of the frequency componentin the noise of the test speech signal; determining the variance ofstrength values of the frequency component in the noise of the trainingspeech signal; determining the variance of the strength values of thefrequency component in the noise segment; and determining the gainfactor by subtracting the variance of the strength values of thefrequency component in the noise of the test speech signal from thevariance of the strength values of the frequency component in the noiseof the training speech signal and dividing the difference by thevariance of the strength values of the frequency component in the noisesegment.
 28. The computer-readable medium of claim 24 whereinselectively adding to the strength of the frequency component furthercomprises: identifying the most negative value produced by adding thegain adjusted strength values to the respective strength values of thefrequency component in each of a plurality of segments that constituteone of the training speech signal and test speech signal; and adding avalue equal to the absolute magnitude of the most negative value to eachstrength value of the frequency component in both the training speechsignal and the test speech signal.
 29. (canceled)
 30. A method ofspeaker recognition that generates a likelihood that the same speakergenerated a training signal and a test signal, the method comprising:generating a matched test signal and a matched training signal byperforming a step for each of a plurality of frequency components, thestep comprising adding to the strength of the frequency component in oneof the test signal or training signal so that the mean strength of thefrequency component of noise in the matched test signal matches the meanstrength of the frequency component of noise in the matched trainingsignal, wherein for some frequency components the step of adding to thestrength of the frequency component in one of the test signal ortraining signal comprises adding to the strength of the frequencycomponent in the training signal and for other frequency components thestep of adding to the strength of the frequency component in one of thetest signal or training signal comprises adding to the strength of thefrequency component in the test signal; creating a model based on thematched training signal; and applying the matched test signal to themodel to produce the likelihood that a same speaker generated thetraining signal and the test signal.
 31. A method of identifying aspeaker comprising: receiving a training speech signal; receiving a testspeech signal; for each of a plurality of frequency components adding tothe variance of the frequency component in one of the training speechsignal or test speech signal so that the variance of the frequencycomponent of noise in a matched training speech signal matches thevariance of the frequency component of noise in a matched test speechsignal; for each of a plurality of frequency components adding to thefrequency component of one of the test speech signal or the trainingspeech signal so that the mean strength of the frequency component ofthe noise in the matched test speech signal matches the mean strength ofthe frequency component of the noise in the matched training speechsignal; generating a model from the matched training speech signal; andcomparing the matched test speech signal to the model to identify thespeaker.
 32. A method of identifying a speaker comprising: receiving atraining speech signal; receiving a test speech signal; receiving asecond training speech signal; for each of a plurality of frequencycomponents adding to the variance of the frequency component in one ofthe training speech signal or test speech signal so that the variance ofthe frequency component of noise in a matched training speech signalmatches the variance of the frequency component of noise in a matchedtest speech signal, wherein adding to the variance of a frequencycomponent in one of the training speech signal or test speech signalcomprises: identifying the largest variance of the frequency componentin the noise of the test speech signal, the noise of the training speechsignal and the noise of the second training speech signal; and adding tothe variance of the frequency component in one of the training speechsignal or test speech signal so that the variance of the frequencycomponent in the noise of the matched test speech signal matches thevariance of the frequency component in the noise of the matched trainingspeech signal; generating a model from the matched training speechsignal; and comparing the matched test speech signal to the model toidentify the speaker.